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[57] ABSTRACT 

A routable operand and selectable operation processor mul- 
timedia extension unit is employed to motion compensate 
MPEG video using improved vector processing. A vector 
processing unit executes an add and divide instruction that 
adds two vector registers and divides the result in a single 
instruction. This is implemented through loading a first 
vector register with a first plurality of elements from a 
source block. A second vector register is then loaded with a 
second plurality, of elements that are adjacent to the first 
plurality of elements. The add and divide instruction is then 
executed on the first and second vector registers, yielding an 
interpolated source element that is stored in a resultant 
vector register. 
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MPEG MOTION COMPENSATION USING be needed between adjacent pixels of a source frame. Any 

OPERAND ROUTING AND PERFORMING improvements in efficiency of such motion compensation 

ADD AND DIVIDE IN A SINGLE are greatly desirable. 



INSTRUCTION 
REFERENCE TO A MICROFICHE APPENDIX 



SUMMARY OF THE INVENTION 



According to the invention, a multimedia extension unit 
The specification includes a software program submitted architecture performs MPEG motion compensation for 
as a microfiche appendix. The microfiche appendix includes graphical display through new, faster, and unique tech- 
one (1) microfiche which has sixteen (16) frames. ^ niques. The MPEG algorithm is highly vectorized. 

Specifically, MPEG motion compensation requires the inter- 

BACKGROUND OF THE INVENTION polatioQ of adjacent pixds ^ illtelpolation j, p6rforme d 

1. Field of the Invention m a single instruction, which both adds two adjacent pixels 
Thcinvcntionpcrtainstoparallelalgorithmsforexecution and divides the result by two, yielding the interpolated 

by a operand-rerouting, multi-operation vector processor. 15 

More specifically, the invention relates to an improved BRIEF DESCRIPTION OF THE DRAWINGS 
MPEG motion compensation technique on such a processor. 

2. Description of the Related Art A better understanding of the present invention can be 
The microcomputer industry has seen a metamorphosis in obtained when the following detailed description of the 

the way computers are used over the last number of years. 20 preferred embodiment is considered in conjunction with the 

Originally, most operating systems were text based requiring following drawings, in which: 

typed user input and providing textual response. These FIG. 1 is a block diagram of a computer system having a 

systems have given way to graphical based environments. processor and a multimedia extension unit of the present 

Current systems are heavily graphically based, both provid- invention; 

ing graphical user interfaces including icons, windows, and 25 FIG. 2 shows a micro -architecture of the processor and 

the like, and providing graphical interaction with a user the multimedia enhanced unit of FIG. 1; 

through a variety of user input devices. FIG. 3 is a more detailed block diagram of the multimedia 

This trend is likely to continue. But graphical, multimedia extension unit of FIG. 2* 

environments place different and greater demands on pro- FIG. 4 shows in more detail an operand router unit of FIG. 

cessor capabilities than the old textual environments. For 3. 

many years, the Intel x86 series of processors by Intel ' ^ T ^, _ . „ 

Corporation has provided the computing power for IBM PC } 5 » a dtagram illustrating motion compensation as 

compatible machines. The architecture of the Intel design, used wording to the MPEG standard; 

however, is not optimized towards graphical operations. FIG - <> is a flowchart illustration of a MEURECON_ 

To this end, a number of extensions to the x86 architecture 35 C0MP routine for motion compensation according to the 

have been proposed and developed. These include the MMX invention; 

extensions developed by Intel Corporation. Further, other FIG. 7 is a flowchart illustration of a HORIZ_ 

manufacturers have similarly extended their instruction sets. INTERPOLATION ONLY routine used in the routine of 

For example, Sun Microcomputing has developed the 4Q FIG. 6; 

UltraSparc, a graphics extension of the SPARC V9 archi- FIG. 8 is a flowchart illustration of a WIDTH16_ 

tecture. HORlZINTl_LOOP routine used in the routine of FIG. 7; 

Typical vector processors provide for multiple operations FIG. 9 is a vector flow diagram illustrating source and 

simultaneously, but require that the same operation be destination block interpolation and averaging according to 

performed by each partition within the vector (SIMD, or 45 the invention; 

single instruction multiple data). In the multimedia exten- FIG. 10 is a vector flow diagram illustrating the source 

sion unit architecture, this has changed. Not only can block interpolation according to the invention; 

multiple operations be concurrently executed on vectorized r>Tr*<? ha a hd a u * n * c 

a * u * A-fv * u 1* 1 FIGS. 11A and 11B are flowchart illustrations of a 

data, but different operations can be simultaneously , 7 ™^ ivrrcnn™ at™xt nxirv *- - 1 « a 

/ j j f . , . , . , L VERT INTERPOLATION ONLY routine implemented 

performed, and the vectorized data can be rerouted through <n . 4t _ — t . c c r 

a number of multiplexers. 5 ° m to routme of 6; 

TOs architecture presents a number of possibilities, but ™j .12 is a flowchart illustration of a WIDTH16 

developing algorithms that efficiently utilise this architec- ^ im P lemented in the ™ tine of 

ture places its own demands, given the new features of the HLr ^ ana AAU * ana 

instruction set. It is desirable to efficiently utilize this 55 FIG - 13 ta a vector flow dia g ram of the P arallel inter P°" 

architecture to execute algorithms for multimedia. lation employed in FIG. 12. 

One particular area of interest in multimedia environ- DETAILED DESCRIPTION OF THE 

ments is the decompression and playback of real time video. PREFERRED EMBODIMENT 
A principal standard which has been developed by industry 

is "MPEG," which provides for both coding and decoding of 60 Turning now to the drawings, FIG. 1 shows a block 
real time video. In the various MPEG techniques, which are diagram of a computer 100. In FIG. 1, a central processing 
described in more detail in various MPEG standards for unit (CPU) 110 provides processing power for the computer 
which various routines have been developed by the MPEG system 100. The CPU 110 is preferably an Intel Pentium- 
Software Simulation Group, a key technique is known as Pro® processor with an multimedia extension unit (MEU), 
motion estimation and compensation. In this technique, 65 as shown in FIG. 2. However, a number of other micropro- 
"macro blocks" of a source frame are shifted for a resulting cessors suitably equipped with an MEU may be used, 
destination frame. As part of this shifting, interpolation may including a PowerPC microprocessor, an R4000 
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microprocessor, a Sparc microprocessor, or an Alpha processing or to a disk drive 118 for offline storage. The 

microprocessor, among others. The CPU 110 is connected to sound board 146 also drives a music quality speaker 150 to 

a read only memory (ROM) 112. The ROM 112 provides support the multimedia-based software. As multimedia pro- 

boot code such as a system BIOS software that boots up the grmos use several medium, the multimedia computer system 

CPU 110 and executes a power-on self test (POST) on the s of mc present invention integrates the hardware of the 

computer system 100 computer system 100 of the present invention. For example, 

In addition, the CPU 110 is connected to a random access the ?™ d b ? ard 146 is ^t^Tn^ m0D j 1 to 0 r 124 * 

memory (RAM) 114. The RAM 114 allows the CPU 110 to ™ d ^display movies and the CD-ROM player 119 is used 

« zy» a. t. n j . • *a l cc u i for audio or video. In this manner, sounds, ammations, and 

buffer instructions as well as data in its buffer while the r a- * a * 1 *u 

. . rtA . . . Tt. ti a ik if* a * c li -m video clips are coordinated to make the computer session 

computer 100 is in operation. The RAM 114 is preferably a 10 * ■ ji ui j - * 

j d am -*u io « . f F more friendly, usable and mterestmg. 

dynamic RAM array with 32 megabytes ot memory. The ^ „ - 

CPU 110 is also connected to a real time clock and timer Tlmm 8 now to £?■ 2 > 1 *™*°n«J "«* dtagram of the 
116. The real time clock and timer 116 stores the date and Processor microarchitecture employed by the present inven- 
ting information for the CPU 110. Furthermore, the real hon f » * ^ ™* P r T °ff° r of * e , "£ enhon ls 
time clock and timer 116 has a lithium backup battery to 15 Inferably b«ed on an Incompatible Pentium-Pro micro- 
maintain the time information even when the computer Processor The mode employed by the present invent 
svstem 100 is turned off addition to the existing modes or the 486 and Pentium 
« rt « TT , ^. , „ . processors, and unless otherwise indicated, the operation 

«J^ C , i 18 * C0Dne ^ d t0 a tek St0r u a 1 ge d ' V1Ce a "d features of the processors remain unchanged. Familiar- 

118 Hie disk storage device 118 stores e^xecutable code as it ^ ^ ation of (he 486 Pentnjm and p entium p ro 

weUasdatatobeprovidedtotheC^ are assumed in this description. For additional details, 

S? 15 conn * cted t0 a CD - R °M .^aUy, an reference shou]d be made tQ the iate data book 

IBM PC compatible computer controls the disk drive 118 Hq ^ cQuld a , so be ^ in Q _ 

" 6 ^r?m ? Vm ^ InteUlgent DnVe EleC_ cessor generations such as the Intel Pentium™, 80486™, 

tromcs (IDE) interface. 80386™, 80286™, and 8086™ microprocessors. The use of 

Additionally, the CPU 110 is connected to a camera 120. the featur es of the multimedia extension unit could also be 

The camera 120 supports video conferencing between the used olher types of microprocessors, including without 

user and other users. The camera 120 essentially consists of limitation, the Power PC architecture, the Sparc architecture, 

a lens, a charge-coupled-device (CCD) array, and an analog and tne M j PS R4000 architecture. For purposes of this 

to digital converter. The lens focuses light onto the CCD ^ disclosure, the terms microprocessor and processor can be 

array, which generates voltages proportional to the light. The used interchangeably 

analog voltages generated by the CCD array are converted [n FIG 2j the pwc&SS0T p employed by the present 

into a digital form by the analog to digital converter for invention interacts with thc system bus and the Level 2 

processing by the CPU 110. cache (not shown) via a bus interface unit 300 bus 

The CPU 110 is also connected to a video card 122. On 35 interface unit 300 accesses system memory through the 

the back of the video card 122 are one or more jacks. system bus. Preferably, the bus interface unit 300 is a 

Connectors for monitors can be plugged into the jacks. The transaction oriented 64-bit bus such that each bus access 

connectors, which are adapted to be plugged into the jacks handles a separate request and response operation. Thus, 

of the video card 122, eventually are connected to the input wn ile the bus interface unit 300 is waiting for a response to 

of a video monitor 124 for display. 4Q one bus request, it can issue additional requests. The inter- 

A pen-based user interface is also provided. A digitizer action with the Level 2 cache via the bus interface unit 300 

126 is connected to the CPU 110 and is adapted to capture is also transaction oriented. The bus interface unit 300 is 

user input. Additionally, a pen 128 is provided to allow the connected to a combination instruction fetch unit and a 

user to operate the computer. The pen 128 and digitizer 126 Level 1 instruction cache 302. The instruction fetch unit of 

in combination supports another mode of data entry in 45 the combination unit 302 fetches a 32-byte cache line per 

addition to a keyboard 132. clock from the instruction cache in the combination unit 302. 

The video monitor 124 receives output video signals from The combination unit 302 is also connected to an instruction 

the CPU 110 and displays these signals to the user. The pointer unit and branch target buffer combination 304. The 

keyboard 132 is connected to a keyboard controller 130 and branch target buffer in turn receives exception/interrupt 

provides input information to the CPU 110. Additionally, 50 status and branch misprediction indications from an integer 

one or more serial input/output (I/O) ports 134 are provided execution unit 324, as discussed below, 

in the computer system 100. Connected to the serial I/O Additionally, the instruction fetch unit/Llcache combina- 

ports 134 are a plurality of peripherals, including a mouse tion 302 is connected to an instruction decoder 306. The 

140 and a facsimile modem 136. The facsimile modem 136 instruction decoder 306 contains one or more simple decod- 

in turn is connected to a telephone unit 138 for connection 55 ers 308 and one or more complex decoders 310. Each of 

to an Internet service provider, for example. Preferably, the decoders 308 and 310 converts an instruction into one or 

modem 136 is a 28.8 kilobits per second modem (or greater) more micro-operations ("micro-ops"). Micro-operations are 

that converts information from the computer into analog primitive instructions that are executed by the processor's 

signals transmitted by ordinary phone lines or plain old execution unit. Each of the micro-operations contains two 

telephone service (POTS). Alternatively, the modem 136 $o logical sources and one logical destination per micro - 

could connect via an integrated service digital network operation. 

(ISDN) line to transfer data at higher speeds. The processor P has a plurality of general purpose internal 

Furthermore, a parallel input/output (I/O) port 142 is registers which are used for actual computation, which can 

provided to link to other peripherals. Connected to the be either integer or floating point in nature. To allocate the 

parallel I/O port 142 is a laser printer 144. Additionally, a 65 internal registers, the queued micro-ops from the instruction 

microphone 148 is connected to a sound board 146 which decoder 306 are sent to a register alias table unit 312 where 

eventually provides input to the CPU 110 for immediate references to the logical register of the processor P are 
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converted into internal physical register references. MEU 320 uses the same registers as the FPU 322. When new 

Subsequently, allocators in the register alias table unit 312 multimedia instructions are executed on the MEU 320, the 

add status bits and flags to the micro-ops to prepare them for registers of the FPU 322 are accessed in pairs. As the FPU 

out of order execution and sends the resulting micro-ops to 322 registers each have 80 bits of data, the pairing of the 

an instruction pool 314 5 FPU 322 registers effectively creates four 160-bit wide 

The instruction pool 314 is also connected to a reservation "g^fr as ^r discussed below. Furthermore, the MEU 

no -n. *• 4, s He 1 • *u 320 adds newly defined instructions which treat registers as 

station 318. The reservation . station 318 also receives the vectQl5 of ^ fixed ^ ^ valu6j . rather ^ [a 

output of he register alias table 312. The reservabon station floati fat numbeis sjnce (he ti tem saves the 

318 handles the scheduling .and, dispatching of m.cro-ops cntife ^ of ^ Fpu 322 as n6C duri context 

from the instruction pool 314. The reservation station 318 M sMe ^ the ati tem needs not be aware of the 

supporte classic out-of-order execution where micro-ops are ngw fo^^ty provided by ^ MEU 320 of the t 

dispatched to the execution unit strictly according to data Mtbou h & e disclosed system contemplates that 

flow constraints and execution resource availability to opti- ^ MEU 320 and ^ Fpu 322 shafe logic Qf Kgiste ^ ^ 

mize periormance. 15 P rocessor P could simply have snooping logic that maintains 

The reservation station 318 is in turn connected to a coherency between register values in completely separate 

plurality of execution units, including a multimedia exten- MEU 320 and FPU 322 sections. 

sion unit (MEU) 320, a floating point unit (FPU) 322, an with respect to status md control bits, the FPU 322 has 
integer unit (IU) 324, and a memory interface unit (MIU) ^ registers for status and con trol: status word, control 
326. The MEU 320, FPU 322, IU 324 and MIU 326 are in word? and tag word FPU 322 registers contain bits for 
turn connected to an internal data-results bus 330. The exception flags, exception masks, condition codes, precision 
internal data-results bus 330 is also connected to the instruc- con trol, routing control and stack packs. The MEU 320 does 
tion pool 314, a Level 1 data cache 332 and a memory not use or modify any of these bits except for the stack pack 
reorder buffer 334. Furthermore, the Level 1 data cache 332 bits> which ^ modificd because the MEU 320 result values 
and the memory reorder buffer 334 are connected to the bus are often not valid floating poim niimbers . Thus, anytime a 
interface unit 300 for receiving multiple memory requests MEU instruction is executed, the entire FPU tag word is set 
via the transaction oriented bus interface unit 300. The to 0xfffh> marking all FPU 322 registers as empty. In 
memory reorder buffer 334 functions as a scheduling and add i t io D , the top of stack pointer in the FPU 322 status words 
dispatch station to track all memory requests and is able to ( bits n_i3) 1S set to 0 to indicate an empty stack. Thus, any 
reorder some requests to prevent data blockage and to ^ MEU 320 instruction effectively destroys any floating point 
improve throughput. values that may have been in the Fpu 322 As the operating 
Turning now to the execution units, the memory interface system saves and restores the complete FPU state for each 
unit 326 handles load and store micro-ops. Preferably, the task, the destruction of floating point values in the FPU 322 
memory interface unit 326 has two ports, allowing it to is not a problem between tasks. However, appropriate soft- 
process the address on a data micro-op in parallel. In this 35 ware action may need to be taken within a single task to 
manner, both a load and a store can be performed in one prevent errors arising from modifications to the FPU 322 
clock cycle. The integer unit 324 is an arithmetic logic unit registers. 

(ALU) with an ability to detect branch mispredictions. The The sharing of the registers of the FPU 322 and the MEU 

floating point execution units 322 are similar to those found 3 2 o avoids adding any new software visible context, as the 

in the Pentium processor. From an abstract architectural 4Q MEU 320 does not define any new processor status, control 

view, the FPU 322 is a coprocessor that operates in parallel 0 r condition code bits other than a global MEU extension 

with the integer unit 324. The FPU 322 receives its instruc- enable bit. Furthermore, the MEU 320 can execute concur- 

tion from the same instruction decoder and sequencer as the rend y with existing instructions on the registers of the 

integer unit 324 and shares the system bus with the integer integer unit 324. Therefore, the CPU 110 logic is well 

unit 324. Other than these connections, the integer unit 324 4f utilized as the MEU 320 is efficiently dedicated to signal 

and the floating point unit 322 operate independently and in processing applications while the FPU 322 is dedicated to 

parallel. floating point intensive applications and the integer unit 324 

In the preferred embodiment, the FPU 322 data registers handles addressing calculations and program flow control, 

consist of eight 80-bit registers. Values are stored in these Additionally, the MEU 320 allows for scalability and 

registers in the extended real format. The FPU 322 instruc- 50 modularity, as the MEU 320 does not change the integer or 

tions treat the eight FPU 322 data registers as a register load/store units. Thereby, the CPU 110 core design is not 

stack. All addressing of the data registers is relative to the impacted when the MEU 320 is included or excluded from 

register on top of the stack. The register number of the the processor P. 

current top of stack register is stored in the top. Load Referring now to FIG. 3, a more detailed block diagram 

operations decrement the top by one and load a value into 55 of the MEU 320 is shown. The MEU 320 contains a vector 

the new top of stack register, and store operations store the arithmetic logic unit (VALU) 342. The VALU 342 is in turn 

value from the current top register in memory and then connected to a plurality of vector registers 344, preferably 

increment top by one. Thus, for the FPU 322, a load four. These vector registers are preferably the same registers 

operation is equivalent to a push and a store operation is as those present in the FPU 322. 

equivalent to a pop in the conventional stack. 60 In the MEU 320, the FPU registers 344 are accessed in 

Referring now to the multimedia extension unit (MEU) pairs. As each of the FPU 322 registers is 80 bits in width, 

320, the MEU 320 enhances the instruction set to include the pairing of the FPU 322 registers effectively creates four 

vector instructions, partitioned instructions operating on 160-bit wide vector registers 344. Thus, as shown in FIG. 3, 

small data elements, saturating arithmetic, fixed binary point the register pairs of the FPU 322 are referred to as V0, VI, 

data, data scaling support, multimedia oriented ALU 65 V2 and V3 and correspond to the physical FPU 322 regis- 

functions, and flexible operand routing. To preserve com- ters. For instance, FPU 322 physical register 0 is the same 

patibility and minimize the hardware/software impact, the as the lower half of the MEU 320 vector register V0. 
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Similarly, FPU 322 physical register 1 is the same as the move operations may optionally be shifted right by one bit 

upper half of MEU 320 vector register V0, while the FPU before being stored to the destination. This scaling can be 

322 physical register 7 is the same as the upper half of the usec j to compensate for the tendency of data magnitude to 

MEU 320 vector register V3. Furthermore, in the MEU 320 grow ^ each add or subtra ct operation. Multiply opera- 

ofFIG.3, the stack based access model of the 80x87 floating 5 ^ { ^ ^ ^ of ^ si d fm md ^ a 

point instructions » not utilized. Instead, the 160-bit regis- Qr 3^ Qduct ^ { &[ m 

tersV0-V3 are partmoned to form vectors of 10-bit or 20-bit _ , , & ^ , , , , j . r 

data elements te 0 product are rounded and dropped before stored 

rrm f 4 t_ „ u .t into the 10-bit or 20-bit destination register. As simple 

The output of the vector registers 344 are subsequently , . , . „ , « . , 

provided to an operand router unit (ORU) 346 and the 10 multl P J y operations typically do not overflow, they do not 

VALU 342. Each vector instruction controls both the ORU need t0 be cli PP ed - However, multiply/accumulate opera- 

346 and the VALU 342. In combination, the ORU 346 and tions do require clipping. 

the VALU 342 allows the processor P to simultaneously 

execute software using flexible operand routing and multiple Turning now to FIG. 4, the details of the operand routing 

operation Referring ; to the flow graph of HG. 13 for ^ ^ ^ ^ ^ QRU ^ ^ {q ^ 

example, the VALU 342 operates on the nodes and the ORU a ... , .... , i*nu-. 

346 implements diagonal interconnections. Tnus, because moved Wlthin and between large 160-bit registers, 

vector arithmetic of different types and data movement can M vcctor Pressors generally must load data from memory 

be processed in groups simultaneously, the VALU 342 and m lar S e monolithic chunks, the ability to route operands is 

the ORU 346 provide high performance. 20 useful fo r foe MEU 320. The ability to flexibly access and 

The VALU 342 can perform a variety of operations, route individual operands, the ORU 346 provides the ability 

including addition, subtraction, multiply, multiply/ t0 "swizzle" the data partitions in a vector register as data 

accumulate, shifting and logical functions. The VALU 342 moves through it. The swizzling operation allows the oper- 

assumes that each of the 160-bit registers 344 is partitioned ands to be shuffled as needed by the application concurrently 

into 10-bit Or 20-bit source operands and destinations. Thus, 25 with the execution of the vector ALU operations. Thus, a 

the VALU 342 can execute 8 or 16 individual operations per smaller amount of data is required to yield useful results, 

instruction. A three-operand instruction format is supported ^ load and stofe units afe less 1Jkcl tQ bc 

by the VALU 342: source A, source B, and destination overloaded> leavi ater bmdwidth for the integer , non . 

registers for each instruction. Additionally, certain . _r - 

6 ,. , n ti i * tu a *• 4- 30 vector units to perform work, 

operations, such as multiply/accumulate use the destination r 

as an implied third source operand. 

The MEU 320 operates primarily in fixed point operation. As shown in FIG. 4, the ORU 346 is essentially an 

The difference between fixed point and integer data is the enhanced 8x8 crossbar switch which works with a plurality 

location of the binary point. In the MEU 320, the binary 35 of slots. In the preferred embodiment, eight slots are pro- 

point is assumed to be to the left of the most significant bit. vided for each of a source B register 350, source A register 

Numbers in the MEU 320 can be considered as fractions that 354 ^ a des tination register 358. The source B register 350 

nominally occupy the range from plus 1 to minus 1. The ^ to a muluplexer 352 . The output of the multi- 

advantage of this format over the integer format is that the n , A . . ~- A . -jj. 

. & , j c *l j » j * i_ plexer 352 and the source A register 354 is provided to a 

numerical magnitude of the data does not grow with each f 7ATTT ~ cc ^ , 7 at?t ^ • ^ 

multiply operation as the product of two numbers in the plus 40 VALU 3 J 56 \ The VALU P a ™ I0n 356 m mm ,s 

1 to minus 1 ranges yields another number in the plus 1 to connected to destination regaster 358. 
the minus 1 range. Therefore, it is less likely the data will 

need to be rescaled. In the vector source B register 350, each slot contains 

The MEU 320 takes advantage of the full 80-bit width of 45 either one 20-bit partition or two 10-bit partitions, depend- 

the FPU 322 register set. The MEU 320 loads data from ing on the partition width as specified in the vector instruc- 

memory in 8-bit or 16-bit quantities, but the data is tion. For 10-bit partitions, the MEU 320 simultaneously 

expanded to 10 bits or 20 bits as it is placed into the vector performs independent but identical operations on the two 

registers 344 (V0 . . . V3). The extended provision provides partitions in a slot. Furthermore, each slot in the destination 

two benefits: (1) simplifying support for signed and 50 register 358 can independently receive one of eleven values: 

unsigned data; and (2) helping to avoid overflow conditions tne value in one of the eight source slots 350 and 354, a Z 

and round-off errors on intermediate results. value ^ a p value or an N value 

Furthermore, the VALU 342 performs all arithmetic 

operations using saturating arithmetic. Saturating arithmetic _ . 4 , . e , . . fc ™ T „* A „ 

differs from the more familiar modular arithmetic when 55 During the execution of codes by the MEU 320, all vector 

overflows occur. In modular arithmetic, a positive value that instructions use a single opcode format that simultaneously 

is too large to fit into destination wraps around and becomes controls ^ VALU 342 and 0RU 346 ' ^ format is 

very small in value. However, in saturating arithmetic, the approximately eight bytes long. Each instruction encodes 

maximum representable positive value is substituted for the the two ■ ource registers, the destination register, the partition 

oversized positive value. This operation is often called 60 size, and the operations to be performed on each partition. In 

clipping. addition, each instruction encodes the ORU 346 routing 

Additionally, the VALU 342 performs adds, subtracts and settings for each of the eight slots. Normally, any two of the 

Boolean operations on 10-bit to 20-bit quantities. If the vector operations defined in the following table may be 

result of an add or subtract is outside of the representable specified in a single vector instruction. Each slot can be 

range, the result is clipped to the largest positive or negative 65 arbitrarily assigned either of the two operations. The vector 

representable value. However, Boolean operations are not instructions offered by the MEU 320 is shown in Tables 1 

clipped. Furthermore, the result of the add, subtract, and and 2, as follows: 
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TABLE 1 



Vector Operation Descriptions 



Category 



Mnemonic 



Description 



Add 



Subtract 



Accumulate 
/Merge 

Negate 

Distance 

Multiply 



Conditional 
Move 



Scale 



Logical 
Shift 

Boolean 



add 



sub 
sbr 



add_ 



sub_ 
sbr_ 



Round 



Magnitude 
Check 



SourceA 
Partition 
Shift 



Slot 
Routing 



Store 

Conversion 



Extended- 
Precision 



neg 
dist 
mul 
mac 

mvz mvnz 
mvgez mvlz 

asr n 
asl n 

lsr n 
lsl n 

false nor 
bnota 
nota anotb 
notb 

xor nand and 

uxor b 

borna 

a aornb or 

true 

rnd n 



mag 



pshra 



bibb 
ahbh 
albl 



ws2u 



emach 
einacl 
cmaci 
carry 



Add source A and sourceB partitions, place sum in destination. 
add_ arithmetically shifts the result right by one bit (computes 
average). 

Subtract partitions, sub does sourceA * source B; sbr does source B - 
source A. sub_ and sbr_ arithmetically shift the result right by 
one bit. 

Add the contents of the destination register partition to the sourceB 

partition and place the sum in the destination. acum_ arithmetically 

shift the result right by one bit. 

Negate sourceB partition and place in destination. 

Subtract partitions then perform absolute value. 

mul multiplies the sourceA partition by the sourceB partition and 

places the product in the destination, mac multiplies sourceA by 

source B and adds the product to the destination. 

Conditionally move partition in sourceB register to partition in 

destination register depending on sourceA partition's relationship to 

zero. 

Arithmetically shifts the operand in sourceB by amount n. N can 
be between 1 and 4 inclusive, asl uses saturating arithmetic and 
shifts zeros in from the right, asr copies trie sign bit from the left. 
Logically shifts the operand in sourceB by amount n. N can be 
between 1 and 4 inclusive. Zeros are shifted in from the left or 
right, lsl uses m odulo arithmetic; it does not clip. 
Perform one of sixteen possible Boolean operations between 
sourceA and sourceB partitions. (The operations are listed in order 
of their canonical truth table representations.) 



Add the constant (l*LSb « n-1) to sourceB, then zero out the n 
lowest bits, n can be between 1 and 4 inclusive. Implements 
"round-to-even" method: If (sourceB <n:0> == 010.. .0), then don't 
do the add. 

This operation can be used to implement block floating point 
algorithms. If the number in sourceB has fewer consecutive leading 
l's or 0's than the number in sourceA, then sourceB is placed in the 
destination; otherwise sourceA is placed in the destination. Only 
the eight leftmost bits of the values are used in the comparison; if 
both sourceA and sourceB start with a run of more than 7 bits, then 
the result is the value from sourceA. This operation is an 
approximation of the "C" statement: (abs(souTceA) <= 
absfsourceB)) ? sourceA : source B. 

For each slot s, copy the contents of slot s + 1 from the sourceA 
register to slot s in the destination register. (If this operation is 
used in slot 7, then the result is immediate zero). This operation 
can be used to efficiently shift data inputs and outputs during 
convolutions (FIR filters, etc.). 

These operations are defined only for 20-bit partitions. They are 
used to route 10-bit data across the even/odd "boundary" that the 
ORU doesn't cross, blbh swaps the upper and lower halves of the 
sourceB operand and places the result in the destination, ahbh 
concatenates the upper half of the sourceA with the upper half of 
sourceB. albl concatenates the lower half of sourceA with the lower 
half of sourceB. 

This operation is used prior to storing 16-bit unsigned data from a 
20 -bit partition. If bit 19 of sourceB is set, the destination is set to 
zero. Otherwise, this operation is the same as lsl 1. 
These operations are used to perform multiply-and-accumulate 
functions while retaining 36 bits of precision in intermediate results; 
they are only denned for 20-bit partitions, emach is the same as 
mac, except that no rounding is done on the LSb. emacl multiplies 
sourceA and sourceB, then adds bits <18:3> of the 39-bit 
intermediate product to bits <15:0> of the destination, 
propagating carries through bit 19 of the destination, emaci is 
similar to emacl, except that bits <19:16> of the destination are 
cleared prior to the summation. The carry operation logically shifts 
sourceB right by 16 bits, then adds the result to SourceA. 
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TABLE 2 



TABLE 3-continued 



Operation Svnonvms 



Alias Actual 



Category 


Name 


Operation 


Description 


Move 


mov 


b 


Move the sourceB register 


SourceB 


mov_ 


asrl 


partition to the destination 
partition. mov_ arithmetically 
shifts the results right by one bit. 


Move 


mova 


a 


Copy the partition in sourceA to 


SourceA 






the destination. 


SourceA 


absa 


dist (. . Z . .) 


Compute the absolute value of 


Absolute 






the sourceA partition. 


Value 








Unmodified 


dest 


acum (. . Z . .) 


Leave the destination partition 


Destination 




unchanged. 


Average 


avg 


add_ 


Compute average of two values. 



Instruction 
Type 



Load and Store Instruction Descriptions 



Mnemonic Fonnat Description 



10 



Bit Store 



16-Byte, 10- 
Bit Store 



of unsigned 8-bit data at address 
meml28. Data is stored using a 2:1 
interleave pattern, 
vstb mem64, vsh Store source register vs to 8 bytes 
of unsigned 8-bit data at address 
mem64. The upper half of each slot is 
stored to memory; the lower half of 
each slot is ignored. 



15 



Turning now to load and store instructions, each type of 
operation has two versions: one that moves 16 bytes of 20 
memory and one that moves 8 bytes of memory. The 8-byte 
versions are defined because this is often the amount of data 
needed; loading or storing 16 bytes in these cases would be 
wasteful. Further, the 8-byte loads and stores can be used to 
convert between byte-precision data and word-precision 25 
data. The 16-byte loads and stores operate on the entire 
160-bit vector register. The 8-byte stores for 20-bit partitions 
store only the values from slots 4 through 7. The 8-byte 
stores for 10-bit partitions store only the upper half of each 
of the eight slots. The 8-byte loads for 20-bit partitions load 30 
the memory data to slots 4 through 7; slots 0 through 3 are 
set to zero. The 8-byte loads for 10-bit partitions load the 
memory data to the upper half of each slot; the lower half of 
each slot is set to zero. Even though 8-byte loads only copy 
memory to half of the bits in a vector register, the entire 35 
160-bit vector register is updated by padding the unused 
partitions with zeros. This feature greatly simplifies the 
implementation of register renaming for the MEU because 
partial register updates do not occur. Table 3 illustrates the 
load and store instructions in more detail: 



TABLE 3 



Instruction 
Type 



Load and Store Instruction Descriptions 



Mnemonic Format Description 



16-Byte, 20- vldw vd, meml28 
Bit Load 

8-Byte, 20- vldw vdh, metn64 
Bit Load 



16- Byte, 10- vldb vd, meml28 
Bit Load 



16-Bytc, 10- vldb vdh, mcm64 
Bit Load 



16- Byte, 20- vstw mem 128, vs 
Bit Store 

8-Byte, 20- vstw mem64, vsh 
Bit Store 

16-Byte, 10- vstb meml28, vs 



Load destination register vd with 
16 bytes of signed 16-bit data at 
address mem 128. 

Load slots 4 through 7 of destination 
register vd with 8 bytes of sigaed 
16-bit data at address mem64. Set 
slots 0 through 3 of vd to zero. 
Load destination register vd with 
16 bytes of unsigned 8-bit data at 
address mem 128. Data is loaded 
using a 2:1 byte interleave pattern. 
Load destination register vd with 
8 bytes of unsigned 8-bit data at 
address mem 64. The upper half of 
each slot receives the memory values; 
the lower half of each slot is set to 
zero. 

Store source register vs to 16 bytes 
of signed 16-bit data at address 
mem 128. 

Store slots 4 through 7 of source 
register vs to 8 bytes of signed 16-bit 
dat at address rnem64. 
Store source register vs to 16 bytes 



40 



45 



50 



55 



60 



The mnemonics for the vector instruction need to specify 
the operations to perform on each partition as well as the 
sources, destination and ORU routing. This is notated as 
follows: 

{sbr sbr add add sbr add sbr add} word V3, V2, 
V1(37P3Z1N2) 

This instruction performs adds and reverse subtracts. V3 
is the destination; V2 is sourceA; VI is sourceB. The slots 
for the operand specifier and the routing specifier are laid out 
in decreasing order from left to right.; slot 7 and 6 get sbr, 
slot 5 gets add, and so forth. The "word" symbol specifies 
that the instruction works on a 20-bit partitions. The routing 
specifier for sourceB is set for the following (the number 
after the points specify slot numbers): 

dest .7 <=* ^-source A 7 +source B . 3 

dest.6<o«-sourceA6+sourceB.7 

dest .5 <~ source A.5 +#1 .0 

dest.4<— source A.4+sourceB. 3 

dest . 3 <— source A. 3 +#0.0 

dest.2<=«source A.2+ sourceB. 1 

dest.l <=o-source A. 1 +#-1 .0 

dest.0<=«sourceA.0+sourceB.2 

Before turning to the details of the multimedia extension 
unit implemented motion compensation routine for recon- 
structing motion compensated frames using MPEG 
decoding, a brief understanding of the basics of such recon- 
struction is helpful. The motion compensation technology 
for MPEG is generally known to the art, for example, 
through the MPEG Software Simulation Group, which has 
provided a number of publicly available programs for imple- 
menting this technology. 

Referring to FIG. 5, shown is a frame 400 for illustrating 
the motion compensation implemented on a multimedia 
extension unit according to the invention. First, in MPEG, 
the standard display element is a "pel," which is composed 
of three component sample values — a luminance component 
and two chrominance components. A basic unit in the frame 
400 is a "macro block," which is composed of a 16x16 array 
of luminance components, and two 8x8 arrays of chromi- 
nance components. Thus, there are four luminance compo- 
nents for each chrominance component. A frame, such as the 
frame 400, is generally stored sequentially in memory. A 
frame can be interlaced with two fields making up the frame. 
The elements beginning at the top lefthand corner at 0,0 are 
stored sequentially to the horizontal end of the frame (LX,0). 
The next row is then sequentially stored from (0,1) to (LX,1) 
and so-on until the entire frame has been stored. Generally, 
the macroblocks are stored separately for the luminance 
values and chrominance values. In the discussion to follow, 
a routine MEURECON_COMP 500 (see FIG. 6) is called 
separately for the luminance values and the chrominance 
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values, with appropriate settings (8x8 versus 16x16 ele- 
ments blocks) set for each. 

Further, referring to FIG. 5, illustrated are a number of 
variables used in the calls to MEURECON_COMP 500. Id 
reference to that routine, it is assumed both a source and 
destination buffer are present in memory. In the source 
buffer, a source block 402 is shown, and in the destination 
buffer, a destination block 404 is shown. Variables X and Y 
designate the current location in the destination buffer that is 
the start of the current macroblock. The width of the current 
macro block is W, and its height is H. Variables DX and DY 
designate the motion vector, and the starting location of the 
source macroblock that will be used for motion compensa- 
tion is (X+DXINT, Y+DYINT) within the source buffer. 
Other variables associated with MPEG are used to handle 
interlaced video and to address the type of motion compen- 
sation that is being used for a particular motion vector. These 
will be understood by those skilled in the art of MPEG 
design. As a further note, DX and DY, as illustrated in FIG. 
5, are actually the integer portions of X and Y of the motion 
vector. This is because MPEG provides for one-half bit 
resolution compensation. That is, the source macroblock for 
motion compensation can be designated to be between the 
actual pels within the source buffer. This, too, will be 
appreciated by those skilled in the art of MPEG decoding. 

Motion compensation in MPEG can be forward only or 
forward and backward in time. If forward and backward, 
typically the forward and backward values are averaged. The 
MEURECON_COMP routine 500 is general enough so that 
when called by a general MPEG reconstruction routine, it 
can perform these different compensation techniques. 

With this background in mind, FIG. 6 shows the technol- 
ogy implemented according to the invention on a multime- 
dia extension unit. FIG. 6 is a flowchart illustration of the 
MEURECON__COMP routine 500, which corresponds to 
the MEURECON_COMP routine in the attached source 
code appendix. The steps illustrated by the blocks of the 
flowcharts that follow are correspondingly labeled in the 
source code appendix. At step 502, the integer and fractional 
portions of the motion vector are separated. This allows the 
integer portion to be used to established the appropriate 
source macroblock, while the fractional portions will be 
used to determine if interpolation in that source according to 
the invention will be needed. Specifically, DXINT and 
DYINT designate the integral portion of the motion vector, 
while DXFRAC and DYFRAC designate the one-bit frac- 
tional portions. 

Turning to step 504, the pointers to the locations within 
the source and destination buffers of the macroblocks that 
will be used are calculated. First, a source macroblock 
pointer S is calculated to be equal to SRC+LX (Y+DYINT)+ 
X+DXINT. It will be understood that this is an appropriate 
offset for the appropriate row (LX-(Y +DYINT)) plus the 
offset within that row, (X+DXINT). Then, a destination 
macroblock pointer D is calculated to be equal to DST+ 
LXY+X. 

The fractional portion of motion vectors, DXFRAC and 
DYFRAC are then analyzed to determine if source interpo- 
lation will be used in either the x or y direction. If the 
fractional portion of DX is non-zero, then horizontal inter- 
polation will be needed. If the fractional portion of DY is 
non-zero, vertical interpolation will be needed. Based on 
these comparisons, one of four routines is executed. These 
routines appropriately move, interpolate, and average the 
source and destination microblocks. The first of these is a 
NO INTERPOLATION routine 506, the next is a VERT_ 
INTERPOLATION_ONLY routine 508. The next is a 



>1,865 

14 

HORIZ_INTERPOLATION_ONLY routine 510, and 
finally a H O R I Z_ AND VERT_I NTER P O LATI O N rou- 
tine 512. These are discussed below in conjunction with 
FIGS. 7 and 11A and 11B. Specifically, the vector manipu- 

s lation is illustrated in detail for the HORIZ_ 
INTERPOLATION_ONLY routine 510, and further details 
are illustrated in conjunction with the VERT_ 
INTER POLATION_ONLY routine 508. The 
NO_INTERPOLATION and HORIZ_AND_VERT_ 
INTERPOLATION routines 506 and 512 will be understood 
from the detailed descriptions of the other two routines. 

After the macroblock motion compensation is complete, 
the routine exits at FIG. 514, returning to calling MPEG 
routines, which proceed with further motion compensation 
and frame reconstruction until the frame is entirely 

15 constructed, at which point it is displayed on a video screen. 
Turning to FIG. 7, details of the HORIZ_ 
INTERPOLATION_ONLY routine 510 are shown. Begin- 
ning at step 600, a variable E is set equal to the height of the 
current macroblock, which would be 8 for a chrominance 

20 block or 16 for a luminance block. Proceeding to step 602, 
it is determined whether the reconstruction is to be done by 
averaging the source macroblock with the destination 
(ADD FLAG true), or simply a straight copy. Source/ 
destination averaging is desirable in a variety of MPEG 

25 techniques, such as dual prime prediction, as one skilled in 
MPEG implementations will appreciate. If the source mac- 
roblock is to be averaged with the destination, control 
proceeds to step 604, where it is determined whether the 
width is 16 elements (corresponding to a luminance block) 

30 or 8 elements (corresponding to chrominance block). If a 
16-element block, control proceeds to step 606, a 
WTDTH16_HORIZINTl_LOOP routine 606, which per- 
forms horizontal interpolation on a 16-element square block. 
Else, control proceeds to a WIDTH8_HORIZINTl__LOOP 

35 routine 608, which performs 8-element square interpolation. 
If at step 602 ADDFLAG is false, indicating that the 
source block is to be copied to the destination block, control 
proceeds to step 610, where it is determined whether this is 
to be a 1 6-element or an 8-element square block copy. If W 

40 equals 16, indicating a luminance block, control proceeds to 
a WIDTH16__HORIZINT2_LOOP routine 612; if an 
8-element square chrominance block, control proceeds to a 
WIDTH8_HORIZINT2_LOOP routine 614. From steps 
606, 608, 612, or 614, control returns to the calling MPEG 

45 routine at step 616. 

Turning to the WIDTH16_HORIZINTl__LOOP routine 
606, this is illustrated in FIG. 8. First, parallel vector 
operations are performed in a step 650, which interpolate 
adjacent pixels from the source block and averages them 

so with the destination block, yielding a first row in the 
destination block. This is further discussed below in con- 
junction with FIG. 9. Proceeding to step 652, both the 
pointers S and D into the source and destination buffers are 
incremented by the frame width. This points to the next row 

55 of elements in the next row of the frame — that is, into the 
next row of the block. Proceeding to step 654, E is 
decremented, and at step 656 if E is not equal to zero, control 
loops to the beginning of the WIDTH16_HORIZINTl_ 
LOOP routine 606 to interpolate the next row of the source 

60 block into the destination block. Else the routine is done at 
step 658, so control returns to FIG. 7. 

Turning to FIG. 9, the details of the interpolation step 650 
are illustrated. Beginning at step 700, the vector V0 is loaded 
with S[0 . . . 15]. Specifically, VOL is loaded with S[0 . . . 

65 7] and V0H is loaded with S[8 . . . 15]. Thus, the first row 
of the source macroblock, where the source is a luminance 
block, is loaded into V0. 
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Proceeding to step 702, the vector VI is loaded with the 
adjacent elements S[l . . . 16]. These are the pel components 
that will be averaged with their adjacent pel components that 
are loaded in V0. This interpolation is done in step 704, 
where an add-and-shift instruction is performed. At step 704, 
S[0] (in slot 0 of VOL) is simultaneously added to S[l] (slot 
0 of V1L) and the result divided by two. Thus, in one step, 
the adjacent source pel components of the first row are 
averaged and stored in V0. Two instructions are not 
needed — one suffices. Proceeding to step 706, vector VI is 
loaded with the destination pixels D[0 . . . 15]. Then, in step 
708, the elements in VI are averaged with the previously 
interpolated source pixels held in V0, yielding a result in V0. 
Again, the averaging takes place in a single step. After step 
708, the results are stored in step 710 into the destination 
buffer at D[0 . . . 15]. 

As discussed in conjunction with FIG. 8, this sequence is 
repeated for each block row within the source and destina- 
tion buffers. It will be appreciated that a single instruction is 
used to interpolate the source pel components, and that in a 
single instruction they are in turn averaged with the desti- 
nation pel components. 

The WIDTH8_HORIZINTl_LOOP routine 608 is simi- 
lar to the WIDTH16_HORIZINTl__LOOP routine 606, but 
instead of 16 bytes being loaded and then stored, only 8 
bytes are loaded and stored. Referring to FIG. 9, only the 
high portion of each vector is used, loading V0H with 
S[0 . . . 7]; loading VlHwith S[l . . . 8]; averaging the two; 
and then averaging with D[0 ... 7], contained in V1H. It will 
be understood that this is appropriate for 8-element width 
source and destination blocks used for chrominance. 

Turning to the WIDTH8_HORIZINT2_LOOP routine 
614, this routine both illustrates the difference between the 
8-element square versus 16 -element square processing dis- 
cussed above and the difference between copying the inter- 35 
polation source to the destination instead of averaging the 
source and destination elements. The WIDTH8_ 
HORIZINT2_LOOP routine 614 is identical to the 
WIDTH16_HORIZINT1_LOOP routine 606 of FIG. 8, 
except that step 650 is replaced by an INTERP8 series of 40 
instructions 750, which are illustrated in FIG. 10. The 
INTERP8 series of instruction only load VOH with S[0 . . . 
7],as compared to FIG. 9 where S[0 . . . 15] were loaded into 
V0. 

In FIG. 10, VOH.O to V0H.7 are loaded with S[0 . . . 7] 
at step 752 and V1H is loaded with S[l . . . 8]in step 754. 
These are then simultaneously summed and divided by two, 
(that is, averaged), in step 756, yielding an interpolated 
result, which is then saved in step 758 to the destination 
buffer at D[0 . . . 7]. That is, the effect of the ADDFLAG is 
that when false, as in the routine 614, the source data is 
averaged with adjacent source data and directly saved to the 
destination buffer in the appropriate position. In additive 
routines, such as routines 606 and 608, the source data is 
averaged with the adjacent source data, and then averaged 
into the destination data. 

TVirning to FIGS. 11 A and 11B, flowcharts of the VERT_ 
INTERPOLATION_ONLY routine 508 is shown. This rou- 
tine is in most respects similar to the HORIZ_ 
INTERPOLATION_ONLY routine 510, except that instead 
of horizontally adjacent pel components being averaged, 
vertically adjacent pel components are averaged. Further, 
the vertical row relationship permits an efficiency of loading 
in certain circumstances, specifically when non-interlaced 
frames are being employed. Many steps are equivalent to the 
corresponding steps discussed in conjunction with FIG. 7, 
and reference is made of that figure for additional details. 
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Beginning at step 800, the variable E is loaded with the 
block height H. Proceeding to step 802, it is determined 
whether ADDFLAG is true, indicating whether the interpo- 
lated source pixels will be averaged with the destination 
pixels, or instead simply copied to the destination pixels. If 
ADDFLAG is true, control proceeds to step 804, where it is 
determined whether interlaced frames are being employed. 
This is determined by comparing the variable LX with the 
variable LX2. LX2 is equal to 2 times LX (or 2 times the 
frame width), for interlaced frames and is equal to LX for 
non-interlaced frames. If a non-interlaced frames are being 
used, control proceeds to step 806, where E, the loop count, 
is divided by 4. This is done because for non-interlaced 
frame, the loading efficiency permits multiple rows to be 
vertically interpolated per pass. This is understood in con- 
junction with the WIDTH16_VERTINTl_LOOP routine 
810 discussed below in conjunction with FIGS. 12 and 13. 
Proceeding to step 808, it is determined whether the width 
is 16, corresponding to a luminance block. If so, control 
proceeds to step 809, where E is further divided by 2. This 
is because for 16 -element square vertical interpolation, 8 
rows are vertically interpolated per pass, whereas with 
8-element square vertical interpolation, 4 rows are interpo- 
lated per pass. 

Proceeding to step 810, the routine WIDTH16_ 
VERTINTl_LOOP routine 810 is executed. This is dis- 
cussed in conjunction with FIGS. 12 and 13, but to 
summarize, it averages the first row with the second row of 
the source block, and then averages the result with the first 
row of the destination block. Then, it averages the previ- 
ously loaded second row with the third row of the source 
block, and then averages the result with the second row of 
the destination block. It repeats in a "flip-flopping" action, 
interpolating rows 0 to 8, yielding destination block rows 0 
to 7. 

If at step 808 the width is not 16, then the width is 8, 
indicating a chrominance block, so control proceeds to step 
812 where a WIDTH8_VERTINT1 _XOOP routine 812 is 
executed, which operates similarly to the WIDTH16_ 
VERTINTl_LOOP routine 810, but only 4 rows are verti- 
cally interpolated at a time. 

Returning to step 804, if the encoding is interlaced, 
control proceeds to step 814. From step 814, it is determined 
whether the width is equal to 16, indicating a luminance 
block, and if so, control proceeds to step 818 where a 
WIDTH16_VERTINT2_LOOP routine 816 is executed. 
Otherwise, if the width is 8, control proceeds to step 818, 
where a Wl DTH8_VERTI NT2__LO OP routine is executed. 
Both the WIDTH16_VERTINT2_LOOP routine 816 and 
the WIDTH8_VERTINT2_LOOP routine 818 only inter- 
polate one source block row with the next source block row 
and then add to the destination block row at a time. That is, 
they do not use the "flip-flop" loading of the routines 810 or 
812. 

Returning to step 802, if ADDFLAG is false, control 
proceeds to step 820. Then, a series of steps 820-836, 
corresponding to the steps 804-818, are executed. The 
difference between the steps 820-836 and steps 804-818 is 
that the source rows are not averaged with the destination 
rows, but are simply copied to the destination rows. This has 
been discussed above in conjunction with the HORIZ_ 
INTERPOLAnON__ONLY routine 510 of FIG. 7. From the 
loops 810, 812, 816, 818, 828, 830, 834, and 836, the routine 
508 returns to its calling routine in a done step 838. 

Turning to FIG. 12, a flowchart of the WIDTH16_ 
VERTlNTl__LOOP routine 810 is shown. This is similar to 
the WIDTH16_HORIZINT1_LOOP routine 606. 
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However, beginning with a step 850, a series of row offsets 
are initially established. This is done by initially establishing 
multiples of LX2, which will be used as offsets into each of 
the first 9 rows of the source block pointed to by S, which 
will then be used to generate the first 8 destination rows D. 

Proceeding to step 852, a series of instructions V1NTERP 
are executed, as discussed below in conjunction with FIG. 
13. These instructions perform the interpolation among the 
first 9 rows of the source block S, and average those 
interpolated results with the first 8 rows of the destination 
block D. 

Control then proceeds to step 854, where the source block 
S is incremented by 8 times LX2, and the same is done for 
the destination block. After this step, S points to the next 8 
rows of the source block and D points to the next 8 rows of 
the destination block. 

Proceeding to step 856, E is decremented and compared 
to zero in step 858. If E is not zero, additional rows remain 
to be interpolated, so control loops to step 850. Otherwise, 
the vertical interpolation is complete, so the routine 810 
returns to the calling routine at step 860. Of note, in step 860 
this would return to whatever routine called the 
MEURECON_COMP routine 500. 

The WIDTH 16„VERTINTl_LOOP routine 810 is illus- 
trative of the other loops in FIG. 12. In FIG. 13, instead of 
horizontally adjacent elements, adjacent rows are averaged 
and then averaged into the destination buffer. But during 
vertical interpolation, since after the first and second source 
rows are averaged the second row remains loaded into a 
vector, it does not need to be reloaded. This is understood in 
conjunction with FIG. 13 and the source code appendix. In 
step 902, V0 is loaded with the first row of the source block 
designated by the pointer S, and in step 904, VI is loaded 
with the next row of the source block designated by S+LX. 
Again, LX is the frame width, so adding the frame width to 
S yields the start of the next row of block. Then, the first and 
second rows are averaged in a step 906, which both adds V0 
and VI and divides them by two simultaneously, yielding 
the interpolated result in V0. This result is then averaged 
with the first row of the destination block, which has been 
loaded into V2 in a step 908. The result is then stored in step 
910 to the first row of the destination block. 

Proceeding to step 912, V0 is loaded with S+2-LX, which 
is the third row of the source macroblock. This is averaged 
with VI in step 914, because VI was previously loaded with 
the second row of the source block in step 904, so there is 
no need to again load this row. The result of this average is 
stored in VO, and is averaged with the second row of the 
destination block in step 916 and then stored in step 918. 
Proceeding to step 920, this time VI is loaded with the 
fourth row of the source block and averaged with V0, which 
contains the third row of the source block, in step 922. The 
result is then averaged in step 924 with the third row of the 
destination macroblock and stored as that third row in step 
926. This is repeated for a series of instructions 928, 930, 
932, 934, and 936. It will be understood that alternately 
either V0 or VI is loaded with the next row of the source 
block and then averaged with the other vector, which con- 
tains the preceding row. This increases efficiency of loads 
because each row need be loaded only a single time. 

It will be appreciated that the previous discussions of the 
WIDTH16_VERTINTl_LOOP routine 810, the 
WlDTH16_HORIZINTl_LOOP routine 606, and the 
WlDTH8_HORIZINT2_LOOP routine 614 illustrate the 
basic elements used for the other combinations of vertical 
and horizontal interpolation. The HORIZ_AND__VERT_ 
INTERPOLATION routine 512 interpolates in both 
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directions, and the NO_INTERPOLATION routine 506 
does no interpolation, but instead either directly copies the 
source to the destination or averages the source with the 
destination. It will be appreciated by those skilled in the art 
5 how the elements of the above routines are used to create the 
other permutations. 

As will be appreciated by those skilled in the art, the 
MEURECON_COMP routine is useful within video 
decompression for ultimate display on a video monitor. It 
provides a high speed method to reconstruct video frames 
for display using standard MPEG format. That is, it can be 
used to display MPEG video through the video card 122 and 
on the screen 124. Further, the MEURECON_COMP rou- 
tine could be stored on the disk 118, in the ROM 112, in the 
RAM 114, on a CD ROM for the CD-ROM player 119, or 
15 other media. 

The foregoing disclosure and description of the invention 
are illustrative and explanatory thereof, and various changes 
in the size, shape, materials, components, circuit elements, 
wiring connections and contacts, as well as in the details of 
20 the illustrated circuitry and construction and method of 
operation may be made without departing from the spirit of 
the invention. 

What is claimed is: 

1. A method of providing MPEG motion compensation 
25 using interpolation of a source block in a computer system, 

the method comprising the steps of: 
providing a vector processing unit with vector operand 
routing, multiple operations per instruction, and an add 
and divide instruction that adds two vector registers and 
30 divides the result in a single instruction; 

loading a first vector register with a first plurality of 

elements from the source block; 
loading a second vector register with a second plurality of 
elements adjacent to the first plurality of elements; and 
35 performing the add and divide instruction on the first and 
second vector register yielding interpolated source ele- 
ments in a result vector register. 

2. The method of claim 1 wherein the result vector register 
is either the first or second vector register. 

40 3. The method of claim 1 further comprising the step of: 
storing the interpolated source elements to a destination 
block. 

4. The method of claim 3 further comprising the steps of: 
45 loading a destination vector register with a plurality of 

destination elements from a destination buffer; 
performing the add and divide instruction on the result 
vector register and the destination vector register yield- 
ing a source-destination averaged result; 
50 storing the source-destination averaged result to the des- 
tination buffer. 

5. The method of claim 1, wherein the first plurality of 
elements are horizontally adjacent to the second plurality of 
elements. 

ss 6. The method of claim 1, wherein the first plurality of 
elements from a first row are vertically adjacent to the 
second plurality of elements from a second row. 

7. The method of claim 6 further comprising the steps of: 
loading the first vector register with a next row from the 

60 source block; 

performing the add and divide instruction on the second 
and first vector registers yielding a next plurality of 
interpolated source elements. 

8. The method of claim 7 further comprising the step of: 
65 alternating the loading of the first and second vector 

registers with a new next row from the source block and 
repeating the performing step for a plurality of rows. 
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9. The method of claim 1, wherein the divide is a divide 
by two. 

10. A computer system for providing MPEG motion 
compensation, the system comprising: 

a vector processing unit with vector operand routing, 5 
multiple operations per instruction, and an add and 
divide instruction that adds two vector registers and 
divides the result in a single instruction; 

means for loading a first vector register with a first 
plurality of elements from a source block; 10 

means for loading a second vector register with a second 
plurality of elements adjacent to the first plurality of 
elements; and 

means for performing the add and divide instruction on 15 
the first and second vector register yielding interpolated 
source elements in a result vector register. 

11. A computer program product for controlling a vector 
processing unit, the program comprising: 

a computer readable medium; 20 
means on said computer readable medium for causing a 
vector processing unit with vector operand routing, 
multiple operations per instruction, and an add and 
divide instruction that adds two vector registers and 
divides the result in a single instruction to load a first 25 
vector register with a first plurality of elements from a 
source block; 

means on said computer readable medium for causing the 
vector processing unit to load a second vector register 
with a second plurality of elements adjacent to the first 30 
plurality of elements; and 

means on said computer readable medium for causing the 
vector processing unit to perform the add and divide 
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instruction on the first and second vector register yield- 
ing interpolated source elements in a result vector 
register. 

12. A system for drawing lines on a video display com- 
prising: 

a processor; 

a multimedia extension unit coupled to the processor 
having operand routing, operation selection, and an add 
and divide instruction that adds two vector registers and 
divides the result in a single instruction; 

a video system; 

a first code segment for execution by said processor and 
said multimedia extension unit, said first code segment 
when executed causing said extension unit to load a 
first vector register with a first plurality of elements 
from a source block; 

a second code segment for execution by said processor 
and said multimedia extension unit, said second code 
segment when executed causing said extension unit to 
load a second vector register with a second plurality of 
elements adjacent to the first plurality of elements; and 

a third code segment for execution by said processor and 
said multimedia extension unit, said third code segment 
when executed causing said extension unit to perform 
the add and divide instruction on the first and second 
vector register yielding interpolated source elements in 
a result vector register, 

wherein said video system is for displaying MPEG video 
motion compensated by said first, second, and third 
code segments. 

***** 
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